addons: Reduce memory consumption#2395
Conversation
Parse dump files incrementaly using ElementTree.iterparse. Clean unused resources during parsing. This method is explained in following article: https://www.ibm.com/developerworks/xml/library/x-hiperfparse/ Memory consumption was reduced about 30% (measured with mprof), execution time increased about 5% (measured with time utility). More description available in PR.
|
This really looks very promising. Good find! |
Use lxml module instead default xml.etree. Lxml provides convenient wrappers around iterparse method that accepts `tag` argument. That easer incremental parsing routines to select specific tags from roottree like `dump` and `dumps`. Element.clear() method was replaced by `lxml_clean` because lxml keeps additional information to nodes that should be removed. Added note about large consumption RAM on large dump files. This commit doesn't solve this problem completely, but provides a way to improve current parser to add incremental Configuration serialization later.
No, that makes not sense before, because we was use standard xml.etree.ElementTree instead lxml as suggested in the article. As I understand, lxml is somehow extends Now I added lxml module as suggested in the article because their It is because we don't iterate over all unnecessary nodes as here. And it faster than current master implementation, when we iterate whole tree twice. So, can we add |
|
These changes reduce execution time and memory consumption in comparison with master. All tests will work if lxml would be added. But in fact, this does not solve the problem with RAM completely. What will happen if we try to check a really large dump file? Consider 430+ Mb dump of jcdctmgr.c: git clone https://github.com/mozilla/mozjpeg && cd mozjpeg
$INSTALL_PATH/bin/cppcheck -j4 --enable=all . --dump
du -h jcdctmgr.c.dump # 431MIf we run About 8 Gb RAM when loading variables and tokens during What should we do about this problem? May be... check the size of XML dump before loading and, if it's requires more RAM than available, use Any suggestions? |
|
I see. I did not have in mind that lxml is not used. IMHO it would be the best if users do not have to install extra python modules for using addons. That would make it more complicated to use the addons than it is now. Personally I would only use available modules for the addons. But I can not decide this. |
Unfortunately, no. If we have no enough memory to load whole XML in memory, we'll die at this place. You can check it out on the large dump file from my post above.
I agree, it can complicate installation process. Ok, I'll try to find other ways to do iterparse without lxml. |
It is not a requirement to use xml.. feel free to suggest some alternative. |
|
I think this sounds like a major refactoring that it's better to merge after the release. |
When using |
sounds ok to me. |
|
@versat you was absolutely right to mention I rewrote iterative parser correctly (without third-party modules, correct resources cleanup) and, sigh, it wasn't solve this problem. With this parser we'll save only So I was wondering what's really requires so much memory. Obliviously, the problem in serialized Python objects. So I use convenient pympler module to measure memory usage of particular objects. Here's some results with As we see, the most memory consumed by configurations list. So the solution seems pretty straightforward. If we see, that our dump files is pretty large, we should return iterator to configurations. The values for particular configuration would be parsed by demand. The size of metadata from Further experiments will take some time, I'll continue work on it. P.S. Actually, I'm not interested at all in mozjpeg project. It was just random github repo, which crashed my small script which scraps github repos and try to use Cppcheck addons on them. |
|
Very interesting, thanks for sharing these finds. I am not sure if it makes sense to implement different approaches for parsing the |
I'm still using python 2.7.15 for all experiments to make sure that we have backward compatibility. Undoubtedly using latest python 3 versions will improve performance and optimize resources usage. Especially useful for us would be dataclasses (introduced in python 3.7). It would be interesting to explore other mechanisms like I'm not yet sure about configurations parsing, need to try that first and write tests later. |
|
I finished iterative parser for Here are my results: Environment: Debian 10.2, Python 3.7.3 and 2.7.16. Target machine: i5-3230M / 16 GiB DDR3-1600 RAM. I use dump file @danmar @versat could you please review these changes? I also want to squash my commits to write a good commit message before merging if you accept it. |
|
I have not looked too deeply into the code but from what I have seen, it looks fine. |
That is OK! |
|
@danmar I suppose it can be merged now. |
|
Let's try it! |








Parse dump files incrementaly using
ElementTree.iterparse. Clean unused resources during parsing. This method is explained in the following article: https://www.ibm.com/developerworks/xml/library/x-hiperfparse/Memory consumption was reduced about 30% (measured with mprof), execution time increased about 5% (measured with time utility).
Resulting graphs created with
mprof:master
this PR
Python version:
2.7.15.I used 20 Mb dump file created from this source.